Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0015 - Add a pending setting to Tekton PipelineRun and TaskRuns #203

Merged
merged 1 commit into from
Nov 12, 2020

Conversation

jbarrick-mesosphere
Copy link
Contributor

This TEP proposes a method for pausing PipelineRuns and TaskRuns that are then
ignored by the Tekton reconciler. These paused resources can be used by platform
implementers to control PipelineRun and TaskRun resources, for example, in cases
where the cluster is under heavy load and not ready to process a PipelineRun.

PipelineRuns and TaskRuns can be created paused so that they are not started at
all or, in the case of a PipelineRun, the pause setting can be used to prevent
further TaskRuns from being created. As it is not possible to "pause" a
Kubernetes pod, the pause setting would not affect TaskRuns after they are
started.

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 10, 2020
The primary motivation for this is to support operators who need to control the
start up of PipelineRuns and TaskRuns in clusters that are under heavy load.
By allowing PipelineRuns to be created without actually starting them, one
could implement a custom controller to implement any PipelineRun scheduling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can already implement custom scheduling today at the Kubernetes level, which would let you prioritize Pod startup, which is roughly equivalent to TaskRun startup.

I think I'd want to see more motivation about why we should implement this at the Tekton level as opposed to at the Kubernetes level, where there's already been a ton of work done to support basically anything you might want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do want to use the advanced scheduling options in Kubernetes, in addition to some other pipeline-aware features.

My primary use-case here is to allow us to create a parallelism limit for PipelineRuns. We will also utilize pod priority to order the tasks in the pipelines that do run, which will help somewhat (also this is not something we can do until Kubernetes 1.19, as currently it is not possible to disable pre-emption). But ordering of pods that run and limiting the actual pipelines that are submitted to the scheduler are separate use-cases.

The issue that we are trying to solve is that when you create a new PipelineRun, if there are many other PipelineRuns with complex pipelines you can have many PipelineRuns competing over the same resources - causing the scheduler to be saturated and for individual tasks to take a very long time to start.

For example, if we were to use ResourceQuotas to limit the number of pods in a namespace, we only limit the number of tasks that can run simultaneously, not the number of pipelines.

So we want to be able to limit the number of PipelineRuns that are started at any given time. We do not want to implement the actual parallelism limit at the Tekton level because we want to have custom limits for the PipelineRuns (e.g., based on the namespace or for a particular repository's pipelines), so this feature allows us to submit inert PipelineRuns and then have Tekton act on them once we're ready.

Copy link
Contributor Author

@jbarrick-mesosphere jbarrick-mesosphere Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this design is more complex than I intended and kind of obfuscates my intentions. The only feature that I need here is to be able to create inert PipelineRuns and start them later. The rest about pausing TaskRuns and supporting pausing after the PipelineRun has already been started is only because those features seemed natural to include at the same time for consistency/least surprise.

Copy link
Member

@imjasonh imjasonh Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's start with the minimal use case you want to add then when we get to an agreement about it we can discuss how to extend it. Going the other way can lead to unused features that were "easy to add" that add support burden.

My main question is: Why not have whatever system is creating-and-unpausing PipelineRuns just hold off on creating them until they're safe to run? TaskRuns/Pods that don't have schedulable resources will already wait until resources become available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay cool, I'll slim the scope of this document and update it.

My main question is: Why not have whatever system is creating-and-unpausing PipelineRuns just hold off on creating them until they're safe to run? TaskRuns/Pods that don't have schedulable resources will already wait until resources become available.

I tried to address this in the alternatives section, but basically this would require us to have a non-Kubernetes-native data store (say, Redis) that would be a lot harder for our users to have visibility into or to create a CRD that is an abstraction over the PipelineRun, which we are hesitant to do since it would add quite a bit of complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said previously, we need to be able to do our own custom limits, not related strictly to PipelineRun

What kinds of limits are you considering? It sounded like from #203 (comment) it was user-defined concurrency limits (I realize now this presumably means end-users, not Operators, that changes requirements a bit)

If this is supported already I'm certainly happy to use it.

This is actually possible today, I think. You could have a mutating admission webhook force new PipelineRuns into the Succeeded=Unknown state (so Tekton doesn't try to start them immediately) and a reconciling controller with permission to update PipelineRun statuses to push them into a running state, at which point Tekton's normal reconciliation would take over and start any TaskRuns.

This should be prototypeable today, though I don't think anybody's done anything like it yet. If you're interested in exploring it I'd be happy to help, and if you find any bugs while investigating I'd be happy to fix them.

This is a disadvantage not a benefit.

Eye of the beholder I suppose 😆 . The thought of multiple systems separately judging the same pool of PipelineRuns seems like a real mess to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually possible today, I think. You could have a mutating admission webhook force new PipelineRuns into the Succeeded=Unknown state (so Tekton doesn't try to start them immediately) and a reconciling controller with permission to update PipelineRun statuses to push them into a running state, at which point Tekton's normal reconciliation would take over and start any TaskRuns.

Yeah, I'll try this out and report back. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kinds of limits are you considering? It sounded like from #203 (comment) it was user-defined concurrency limits (I realize now this presumably means end-users, not Operators, that changes requirements a bit)

Depends on use cases, the granularity of applying the limits may not be aligned with namespaces. For example, some may want to set concurrency limits per service account or per "repository."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like it supports this status yet as it immediately runs:

status:
  conditions:
  - lastTransitionTime: "2020-09-15T18:15:29Z"
    message: 'Not started yet.'
    reason: ConcurrencyCapped
    status: "Unknown"
    type: Succeeded

I went ahead and narrowed the scope of the document to just PipelineRuns. I didn't change the API interface yet, but if everyone prefers it as a condition I am happy to change it to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it looks like #3223 uses the PipelineRunPaused setting to tell Tekton to pause, which sets the Status condition to Paused. I'll change the TEP to that.

@imjasonh
Copy link
Member

Re: Kubernetes' many scheduling options:

There's also tektoncd/pipeline#3052 to discuss customizing Tekton's Pod scheduling, which you could design in such a way to give hooks that could enable someone to implement a "pause"-like functionality.

@bobcatfish
Copy link
Contributor

Thanks for putting this together and explaining the backgorund @jbarrick-mesosphere ! I have 2 main thoughts:

  1. I don't totally understand what we gain by creating PipelineRuns and TaskRuns ("*Runs") that do not execute - if we added the ability to pause/unpause, I'm not sure what that gives us beyond having some other type and controller that decides when to create PipelineRuns and taskRuns - i.e. i feel like if you wanted to have another controller deciding when to unpause, why not make that responsible to creating the *Runs, period? Up until this point creating a *Run is the same as saying: execute this now, and I'm hesitant to change that
  2. We recently added concurrency limits (provide a pipeline concurrency limit pipeline#1305) to our roadmap - if the ultimate goal of this proposal is to take a step towards having those kinds of limits, is there any chance you'd be interested in tackling that feature instead?


## Alternatives

* Instead of adding this setting directly to the PipelineRun or TaskRun object,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spit balling here. This might be an opportunity make to make custom interceptor with something like rabbitMQ. So, when a event is made you can broker it yourself. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* PipelineRuns that are not ready to be started could be stored in a
  non-Kubernetes-native queue, such as Redis, and only submitted once they are
  ready to run. This adds complexity in that a new data store must be run and
  it is not easy to provide users visibility into the queue.

Copy link
Member

@popcor255 popcor255 Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should clarify. Tekton triggers can take events and map them to pipelineruns. There a list of interceptors that can manipulate payloads and do extra processing. I agree this approach isn't the best solution. I just think it would be easier to schedule the events rather than pipeline/tasks themselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wondering if this is an event issue. 😅

Copy link
Member

@popcor255 popcor255 Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually possible today, I think. You could have a mutating admission webhook force new PipelineRuns into the Succeeded=Unknown state (so Tekton doesn't try to start them immediately) and a reconciling controller with permission to update PipelineRun statuses to push them into a running state, at which point Tekton's normal reconciliation would take over and start any TaskRuns.

This actually just a variation on this comment. nvm. lol

@jbarrick-mesosphere
Copy link
Contributor Author

1. I don't totally understand what we gain by creating PipelineRuns and TaskRuns ("*Runs")  that do not execute - if we added the ability to pause/unpause, I'm not sure what that gives us beyond having some other type and controller that decides when to create PipelineRuns and taskRuns - i.e. i feel like if you wanted to have another controller deciding when to unpause, why not make that responsible to creating the *Runs, period? Up until this point creating a *Run is the same as saying: execute this now, and I'm hesitant to change that

PipelineRuns encodes information that the controller handling concurrencies doesn't need to be aware of and as I pointed out in alternatives we would either have to add new types or introduce a new datastore to accommodate that, both of which add unneeded complexity and make things more complex for the user from a UI perspective.

2. We recently added concurrency limits ([tektoncd/pipeline#1305](https://github.com/tektoncd/pipeline/issues/1305)) to [our roadmap](https://github.com/tektoncd/pipeline/blob/master/roadmap.md#feature-complete) - if the ultimate goal of this proposal is to take a step towards having those kinds of limits, is there any chance you'd be interested in tackling that feature instead?

Our concurrency limit requirements are more complex than should be implemented in Tekton itself, so we need an integration point.


This TEP proposes a method for pausing PipelineRuns and TaskRuns that are then
ignored by the Tekton reconciler. These paused resources can be used by platform
implementers to control PipelineRun and TaskRun resources, for example, in cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case, I need the pause status so that the user can do some manual work after a certain task has been completed. For example, once the preprocessing task is complete, we'd like to pause the pipeline, do some label work manually, and after that we can cancel the pause status and let the pipeline run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My case is quite different from @jbarrick-mesosphere 's and my PR is based on this "manual work needed" situation 🤔


PipelineRuns and TaskRuns can be created paused so that they are not started at
all or, in the case of a PipelineRun, the pause setting can be used to prevent
further TaskRuns from being created. As it is not possible to "pause" a
Copy link
Member

@FogDong FogDong Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need pause in PipelineRun but not TaskRun since runAfter only available in PipelineRun. Once a PipelineRun is paused, after the completion of the current pipeline, the next pipeline that runs after the previous one will be paused, but other pipelines won't be affected.


## Alternatives

* Instead of adding this setting directly to the `PipelineRun` object,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be any problems if the client-go requeue operation is used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure I understand the question


### Non-Goals

* Support pausing of a running `TaskRun` as that would requiring pausing the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref #issue/pipeline2217

If we really need pause taskrun.

We can watch the resource status of taskrun, just like the design of pipelinerun, if taskrunpaused is set, it will be suspended.
The following process:
watch taskrun ---> find where taskrun(Pod) is on the node --->find Pod StepUp(containerID) ---> find processID by containerID---> use daemonset excute “kill -STOP Processid”.

it is so hacking.

what do you think of this idea?

Copy link
Contributor Author

@jbarrick-mesosphere jbarrick-mesosphere Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds very hacky and I don't have a use-case for that feature myself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chhsia0 has mentioned renaming this particular TEP to something like "Pending" instead of "paused" - if we want to pursue a "pause" feature maybe a separate TEP would make sense?

@withlin
Copy link
Member

withlin commented Sep 16, 2020

I think we only need pause in PipelineRun but not TaskRun since runAfter only available in PipelineRun. Once a PipelineRun is paused, after the completion of the current pipeline, the next pipeline that runs after the previous one will be paused, but other pipelines won't be affected.

I agree with @FogDong idea, we only pause pipelinerun.

BTW,if you want to complete other advanced feature. which should try to open another pr to perfect it. otherwise I think it will be very complicated.

@vdemeester vdemeester added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Sep 17, 2020
@FogDong
Copy link
Member

FogDong commented Sep 22, 2020

ping /cc @imjasonh @bobcatfish

@chhsia0
Copy link
Contributor

chhsia0 commented Oct 26, 2020

The purpose of this PR is to provide a way for a custom controller to decide when to start a PipelineRun, so users could implement arbitrary scheduling policy. Both of this PR and #222 are trying to resolve resource contentions in the cluster. It doesn't aim for pausing a running PipelineRun. If there's a need for this, let's open another TEP for discussion.

In today's API WG, no one opposes this direction, but the term "Paused" is probably a bit misleading. Let's change to another name, e.g., "Pending" maybe?

Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally in favor of this! And I like @chhsia0 's suggestion to rename it to something like "pending"

I have 1 question: what happens if someone tries to set this value once execution of the Run has started? (I'm thinking we should fail that in the webhook validation, which would prevent this from ever happening)

- "@jbarrick-mesosphere"
creation-date: 2020-09-10
last-updated: 2020-09-10
status: draft
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually these start as "proposed" or go straight to "implementable"

we can definitely merge as "proposed" pretty quickly since we've discussed this a bunch, if you want to go straight to "implementable" maybe we can get a couple more folks to review @tektoncd/core-maintainers


### Non-Goals

* Support pausing of a running `TaskRun` as that would requiring pausing the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chhsia0 has mentioned renaming this particular TEP to something like "Pending" instead of "paused" - if we want to pursue a "pause" feature maybe a separate TEP would make sense?

* `PipelineRuns` that are not ready to be started could be stored in a
non-Kubernetes-native queue, such as Redis, and only submitted once they are
ready to run. This adds complexity in that a new data store must be run and
it is not easy to provide users visibility into the queue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another alternative we could add here is instead of adding a field to spec.status we add a new condition type, e.g. in addition to condition "succeeded" we have a condition "pending" or something

i think that the current proposal is better because I think it reflects the purpose of spec vs. status better:

  • spec = the desired behavior (e.g. don't actually run it)
  • status = what is actually happening (e.g. it is not running)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree withe the status Pending.

Actually the way I achieve Paused is add PipelinePaused to spec.status just like the way we achieve PipelineCancelled. Ref: tektoncd/pipeline#3223

Maybe I should write another TEP and change the name Pause to Pending in my PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FogDong tho they are similar I see these as 2 fairly different sets of functionality:

  1. Pausing Runs = no matter what state the Run currently is in, stop it and don't allow it to progress further (which I think is what Feat: Add pause condition in pipeline run pipeline#3223 is about?)
  2. Introduce a way to control initial scheduling of runs = Introduce a way to "freeze" Runs before they execute and make it possible for an external system to control when Runs are actually executed (which is what this "pending" TEP is about)

I would see having two different "levers" (e.g. 2 different values in spec.status) to control this. Or even if we use the same spec.status value for both, I think we'd likely approve and move forward with the proposal here, which has a much smaller scope.

To move forward with the pausing functionality we'll probably need to discuss in a further TEP - and especially to understand the use cases around this. (note that a "pause" of sorts could be introduced via a custom task tep 0002

@jbarrick-mesosphere
Copy link
Contributor Author

Thanks all, I have updated the TEP to reflect our "pending" setting instead of "pause". I also added a note that the validating webhook should reject updates that set pending after the PipelineRun has started.

Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal looks reasonable to me! I'm happy to approve from my perspective - need to get some more non-google approvals as well if possible @tektoncd/core-maintainers

/approve


If the `PipelineRunPending` setting is set after a PipelineRun has started, the
validating webhook should reject the update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth being explicit around how this state relates to the PipelineRun timing out, i.e. does the timeout start counting down while the PipelineRun is still pending or only once it starts running? (I'm guessing it doesn't start until the PipelineRun actually starts running?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pinging this - id like to be explicit about how this interacts with timeouts in the proposal 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeouts use the .status.startTime field which is set once the PipelineRun has actually started running, not the PipelineRun creation time, so there is no impact. I have added a note. This is ready for merge now.

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 2, 2020

To start a PipelineRun, the operator should set the `Status` field to empty,
the Tekton controller will see the change and proceed with the Pipeline as
normal. Upon starting, the `Succeeded` `Condition` `Reason` will be changed to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more details what "normal" means here?

What is reconciler doing after resuming back? Is there any specific validation reconciler needs to implement/check to resume back?

Cancellation is an easy job, cancel task, update status, and be done. But I am guessing resuming is not. What are the possible pipelineRun states which can allow pipelineRun going in pending mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I have tried to clarify this in the document, let me know if it helps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Justin, its not clear to me but thats fine, its just me, please squash the commits before we can merge. We can discuss further in the implementation PR.

I am having hard time understanding, when can I send a pipelineRun in pending mode? As soon as a pipelineRun is instantiated, it initializes the pr.Status.Conditions with status set to unknown and reason to started with the start time. Can a started pipeline be assigned pending state?

I am assuming here the answer is no, since the pipeline is already marked as started and have started validating resources.

After all the validation is done, pr.Status is set to running from started. How is webhook validating that the pipelinerun has not been created/running?

When a status field is set to empty (to resume from pending state), what state is being assigned to pr.Status? started, running ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritidesai if this helps, the webhook will validate before the reconciler does anything

the sequence of events that i understand will be (e.g. imagining you're doing something like kubectl create/apply -f pipelinerun.yaml or a tool is doing the equivalent):

  1. make call to API server to create PipelineRun
  2. API server calls out to any validating/mutating webhooks
  3. If the webhooks indicate the PipelineRun is invalid, the creation fails and an error is returned to the user
  4. API server stores the pipelinerun in etcd
  5. API server notifies any watching connections of the creation of the new pipelinerun
  6. pipeline controller is notified of new pipelinerun, then starts reconcling

At (6), the controller would see the "pending" spec.status and would update the conditions to "unknown pipelinerun is pending"

Later on, you'd actually start the PipelineRun by removing spec.status.pending. At that point this would happen:

  1. make call to API server to modify PipelineRun
  2. API server calls out to any validating/mutating webhook
  3. Removing spec.status.pending is valid, so the mutation is allowed
  4. API server stores the pipelinerun in etcd
  5. API server notifies any watching connections of the modification of the pipelinerun
  6. pipeline controller is notified of modification, then starts reconcling

Then in (6) the controller would see that "pending" had been removed and know to transition from "unknown pipelinerun is pending" to actually starting to execute the pipeline.

After that point, if anyone tried to modify spec.status.pending, the webhook would see in the status of the PipelineRun that it is already running and deny the mutation; the change would never even be stored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a bunch @bobcatfish for the walkthrough 😻

so between 5 and 6, the process watching these connections of creating a new pipelineRun will have to update the conditions to pending and looking at the controller logs (also depends on how the monitoring process is designed/implemented and the platform its running on), not much time (in may be part of micro seconds) before a pipelineRun actually starts running after getting created.

If the PipelineRunPending setting is set after a PipelineRun has been created, the validating webhook should reject the update.

Also, this statement is very confusing, compared to transitioning from 5 to 6, if pipelineRun creation is considered invalid going into pending state, how would the watching process know a pipelineRun was initiated?

Overall approach seems reasonable. This will be very interesting to see being implemented 😜

/lgtm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the new service, e.g. let's call it Concurrency Service, will need to add the "pending" label? You're right that the window to do that in would be impossibly small and racy.

Yup, the Concurrency Service having a very small window.

@jbarrick-mesosphere is already saying this but to be totally clear: we cannot rely on a watching service to add the pending label (b/c the window would be so small and racy, if its even possible) - using this feature would require creating PipelineRuns with spec.status.pending

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch @bobcatfish and @jbarrick-mesosphere

so, run is created in pending mode and watching service is responsible for clearing this label so that the run can be scheduled/executed.

What happens if the pipeline or task definition changes between the time it was sent to pending vs the time it was activated to run? The run continues with the old or new definition? or its discarded saying the definition changed?

When it says the status is set to empty to activate the run, is it the same status which was used to send it to pending? In that case, isn't it equivalent to creating a new run from the syntax perspective? of course, the controller will check to create a new run or activate the pending run. I am not objecting this but just making it explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the pipeline or task definition changes between the time it was sent to pending vs the time it was activated to run? The run continues with the old or new definition? or its discarded saying the definition changed?

That's an interesting question - afaik today you can update a PipelineRun in progress and the changes will be picked up and used - e.g. say. you were running a PipelineRun with param X set to "foo", used by Task A, then Task B (which runs after Task A). If, while Task A was running, you changed the value of the param X to "bar", I think Task B would end up using Bar.

I might be wrong but I don't think we have anything in place that would prevent that from happening today, so I think this question applies globally to Run execution and not just to this feature.

In that case, isn't it equivalent to creating a new run from the syntax perspective?

I think it's very similar to "cancelling" a pipeline run, which is also done by modifying the value of spec.status (i.e. we do not treat this as creating a run, we treat this as a way to communicate the desired state of the existing run)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the pipeline or task definition changes between the time it was sent to pending vs the time it was activated to run? The run continues with the old or new definition? or its discarded saying the definition changed?

This is a very good question. Depending on what we wanna do, we could prevent that. At the "creation" time and first "reconciliation" loop, we get the PipelineSpec (from whatever it comes from) and store it into the status (also known as dereference). We could base the rest of the code and reconciliation loop to look at the status's pipelineSpec (and taskSpec) as source of truth.

This would fix a bunch of potential problems too (that exists today) : what happens when the def of reference change (also applies to bundles), and it would probably reduce api calls (as we would have all we need at hand).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong but I don't think we have anything in place that would prevent that from happening today, so I think this question applies globally to Run execution and not just to this feature.

I agree the specified example is not specific to this feature but applies globally. Yup, we work on modified spec so the changes are picked up. I am more concerned with one scenario with this feature is permanentError i.e. Task A existed at the time of creating a run in pending mode but at the time executing, if Task A is missing, controller would exit with missingTaskDef but the pr.Status.Spec would actually have that definition. This is something very rare and applies more to this feature.

I like what @vdemeester is suggesting to consider pr.Status.Spec as a source of truth which is more visible in this feature and bundles (hopefully not fetching bundles on every reconciliation loop, though I am not able to recall how its implemented).

I think it's very similar to "cancelling" a pipeline run, which is also done by modifying the value of spec.status (i.e. we do not treat this as creating a run, we treat this as a way to communicate the desired state of the existing run)

Right, but the existing proposal does not have any special status to cancel pending state and resume executing the run. spec.Status is set to cancel to cancel existing run, spec.Status is set to pending to create run in pending state, spec.Status to empty ``, to resume pending run?

Also, something more to consider is how long a pipelineRun can stay in pending state? Controller can create many such runs and forget about them. What happens if those runs are never resumed back? This might seem out of scope here but this can fill up etcd quickly and can impact the controller.

These are all very important and useful discussions and I appreciate all of your patience 🙏

@pritidesai
Copy link
Member

pritidesai commented Nov 4, 2020

Overall, I would like understand the state transition/movement in the proposal, otherwise looks good.

Also, I do not quite understand the benefit of creating pending PipelineRuns but we clearly have a use case which needs this feature, so lets get this in 😄

@bobcatfish bobcatfish changed the title TEP-0015 - Add a pause setting to Tekton PipelineRun and TaskRuns TEP-0015 - Add a ~pause~ pending setting to Tekton PipelineRun and TaskRuns Nov 9, 2020
@jbarrick-mesosphere jbarrick-mesosphere changed the title TEP-0015 - Add a ~pause~ pending setting to Tekton PipelineRun and TaskRuns TEP-0015 - Add a pending setting to Tekton PipelineRun and TaskRuns Nov 10, 2020
update motivations text

Narrow document scope to just PipelineRuns

Specify changing the Succeeded Condition's Reason to Paused.

change paused to pending

Add clarity on state transitions

add note about pipelinerun timeouts
@tekton-robot tekton-robot merged commit 7360755 into tektoncd:master Nov 12, 2020
@jbarrick-mesosphere
Copy link
Contributor Author

I've put up a PR to implement this here: tektoncd/pipeline#3522

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants